Arabic Dialect Processing Tutorial

نویسندگان

  • Mona T. Diab
  • Nizar Habash
چکیده

The existence of dialects for any language constitutes a challenge for NLP in general since it adds another set of variation dimensions from a known standard. The problem is particularly interesting and challenging in Arabic and its different dialects, where the diversion from the standard could, in some linguistic views, warrant a classification as different languages. This problem would not be as pronounced if Modern Standard Arabic (MSA) were the native language of some sub group of Arabic speakers, however it is not. Any realistic and practical approach to processing Arabic will have to account for dialectal usage since it is so pervasive. In this tutorial, we will attempt to highlight different dialectal phenomena, how they migrate from the standard and why they pose challenges to NLP. This area of research (dialects in general and Arabic dialects in particular) is gaining a lot of interest. For example, the DARPA-funded BOLT program starting this year will only consider dialectal varieties for its effort on Arabic. Furthermore, there was a workshop on dialect processing as part of EMNLP 2011. This tutorial has four different parts: First, we contextualize the question of Arabic dialects from a sociolinguistic and political perspective. Second, we present a discussion of issues in relevant to Arabic NLP; this includes generic issues common to MSA and dialects, and MSA specific issues. In the third part, we detail dialectal linguistic issues and contrast them to MSA issues. In the last part, we review the stateof-the-art in Arabic dialect processing covering several enabling technologies and applications, e.g., dialect identification, speech recognition, morphological processing (analysis, disambiguation, tokenization, POS tagging), parsing, and machine translation. Throughout the presentation we will make references to the different resources available and draw contrastive links with standard Arabic and English. Moreover, we will discuss annotation standards as exemplified in the Treebank. We will provide links to recent publications and available toolkits/resources for all four sections. This tutorial is designed for computer scientists and linguists alike. No knowledge of Arabic is required (though, we recommend taking a look at Nizar Habash's Arabic NLP tutorialhttp://www1.ccls.columbia.edu/~cadim/presentations.html which will be reviewed as part of the tutorial.)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Borrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab

“Borrowing” is a lingual process that is studied in diachronic linguistics. In this process a language borrows elements from another language. This process usually occurs in areas that two languages make contact with each other. In a dialect spoken in South Khorasan the language borrowing happens. Arabs living in this part of Iran probably have immigrated in the early centuries of Islam. In thi...

متن کامل

The Status of [h] and [ʔ] in the Sistani Dialect of Miyankangi

The purpose of this article is to determine the phonemic status of [h] and [ʔ] in the Sistani dialect of Miyankangi. Auditory tests applied to the relevant data show that [ʔ] occurs mainly in word-initial position, where it stands in free variation with Ø. The only place where [h] is heard is in Arabic and Persian loanwords, and only in the pronunciation of some speakers who are educated and/or...

متن کامل

Improved Arabic Dialect Classification with Social Media Data

Arabic dialect classification has been an important and challenging problem for Arabic language processing, especially for social media text analysis and machine translation. In this paper we propose an approach to improving Arabic dialect classification with semi-supervised learning: multiple classifiers are trained with weakly supervised, strongly supervised, and unsupervised data. Their comb...

متن کامل

Grapheme to phoneme conversion: an Arabic dialect case

We aim to develop a Speech-to-Speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a Grapheme-to-Phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as an under-resourced language becau...

متن کامل

Diacritics restoration for Arabic dialect texts

Vocalization, diactritization or diacritics restoration is one of the major challenges in Arabic natural language processing. Algiers dialect is also concerned by this issue. In this paper, we present an automatic diacritization system for standard and dialect Arabic texts based on statistical approach. The idea is to use available tools in statistical machine translation to build such a system...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007